Search CORE

43 research outputs found

Towards unsupervised learning of speech features in the wild

Author: Dupoux Emmanuel
Rivière Morgane
Publication venue: HAL CCSD
Publication date: 13/12/2020
Field of study

International audienceRecent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has mostly been applied to clean, curated speech datasets. Can it also be used with unprepared audio data "in the wild"? Here, we explore three potential problems in this setting: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a relatively clean speech-only dataset, these problems combined can already have a performance cost of up to 30% relative for the ABX score. We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech segments, while perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive branch of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting

INRIA a CCSD electronic archive server

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Author: Douze Matthijs
Dupoux Emmanuel
Kharitonov Eugene
Mazaré Pierre-Emmanuel
Rivière Morgane
Synnaeve Gabriel
Wolf Lior
Publication venue
Publication date: 02/07/2020
Field of study

Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Author: Dupoux Emmanuel
Haziza Daniel
Lee Ann
Pino Juan
Rivière Morgane
Talnikar Chaitanya
Wang Changhan
Williamson Mary
Wu Anne
Publication venue
Publication date: 27/07/2021
Field of study

We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semi-supervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 16 languages and their aligned oral interpretations into 5 other languages totaling 5.1K hours. We provide speech recognition baselines and validate the versatility of VoxPopuli unlabelled data in semi-supervised learning under challenging out-of-domain settings. We will release the corpus at https://github.com/facebookresearch/voxpopuli under an open license.Comment: Accepted to ACL 2021 (long paper

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation

Author: Bergelson Elika
Boissonnet Alodie
Bredin Hervé
Copet Jade
Cristia Alejandrina
Dupoux Emmanuel
Lavechin Marvin
Métais Marianne
Rivière Morgane
Titeux Hadrien
Publication venue
Publication date: 27/10/2022
Field of study

Most automatic speech processing systems are sensitive to the acoustic environment, with degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a pipeline to simulate audio segments recorded in noisy and reverberant conditions. We then use the simulated audio to jointly train the Brouhaha model for voice activity detection, signal-to-noise ratio estimation, and C50 room acoustics prediction. We show how the predicted SNR and C50 values can be used to investigate and help diagnose errors made by automatic speech processing tools (such as pyannote.audio for speaker diarization or OpenAI's Whisper for automatic speech recognition). Both our pipeline and a pretrained model are open source and shared with the speech community

arXiv.org e-Print Archive

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Author: Douze Matthijs
Dupoux Emmanuel
Kharitonov Eugene
Mazaré Pierre-Emmanuel
Rivière Morgane
Synnaeve Gabriel
Wolf Lior
Publication venue: HAL CCSD
Publication date: 13/12/2020
Field of study

International audienceContrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative

INRIA a CCSD electronic archive server

VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Author: Dupoux Emmanuel
Haziza Daniel
Lee Ann
Pino Juan
Rivière Morgane
Talnikar Chaitanya
Wang Changhan
Williamson Mary
Wu Anne
Publication venue: HAL CCSD
Publication date: 02/08/2021
Field of study

International audienceWe introduce VoxPopuli, a large-scale multilingual corpus providing 400K hours of unlabeled speech data in 23 languages. It is the largest open data to date for unsupervised representation learning as well as semisupervised learning. VoxPopuli also contains 1.8K hours of transcribed speeches in 15 languages and their aligned oral interpretations into 15 target languages totaling 17.3K hours. We provide speech recognition (ASR) baselines and validate the versatility of VoxPopuli unlabeled data in semisupervised ASR and speech-to-text translation under challenging out-of-domain settings. The corpus is available at https://github. com/facebookresearch/voxpopuli

INRIA a CCSD electronic archive server

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Author: Collobert Ronan
Dupoux Emmanuel
Fuegen Christian
Joulin Armand
Kahn Jacob
Karadayi Julien
Kharitonov Evgeny
Likhomanenko Tatiana
Liptchinsky Vitaliy
Mazaré Pierre-Emmanuel
Mohamed Abdelrahman
Rivière Morgane
Synnaeve Gabriel
Xu Qiantong
Zheng Weiyi
Publication venue: HAL CCSD
Publication date: 20/12/2019
Field of study

We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art

The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

Author: Baevski Alexei
De Seyssel Maureen
Dunbar Ewan
Dupoux Emmanuel
Kharitonov Evgeny
Nguyen Tu Anh
Rivière Morgane
Rozé Patricia
Publication venue: HAL CCSD
Publication date: 01/12/2020
Field of study

14 pages, including references and supplementary materialInternational audienceWe introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels, along with the Zero Resource Speech Benchmark 2021: a suite of 4 black-box, zero-shot metrics probing for the quality of the learned models at 4 linguistic levels: phonetics, lexicon, syntax and semantics. We present the results and analyses of a composite baseline made of the concatenation of three unsupervised systems: self-supervised contrastive representation learning (CPC), clustering (k-means) and language modeling (LSTM or BERT). The language models learn on the basis of the pseudo-text derived from clustering the learned representations. This simple pipeline shows better than chance performance on all four metrics, demonstrating the feasibility of spoken language modeling from raw speech. It also yields worse performance compared to text-based 'topline' systems trained on the same data, delineating the space to be explored by more sophisticated end-to-end models

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server